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ABSTRACT 

A variety of fan-based wikis about episodic fiction (e.g., tele¬ 
vision shows, novels, movies) exist on the World Wide Web. 
These wikis provide a wealth of information about complex 
stories, but if readers are behind in their viewing they run 
the risk of encountering “spoilers” - information that gives 
away key plot points before the intended time of the show’s 
writers. Enterprising readers might browse the wiki in a web 
archive so as to view the page prior to a specific episode date 
and thereby avoid spoilers. Unfortunately, due to how web 
archives choose the “best” page, it is still possible to see 
spoilers (especially in sparse archives). 

In this paper we discuss how to use Memento to avoid spoil¬ 
ers. Memento uses TimeGates to determine which best 
archived page to give back to the user, currently using a 
minimum distance heuristic. We quantify how this heuris¬ 
tic is inadequate for avoiding spoilers, analyzing data col¬ 
lected from fan wikis and the Internet Archive. We create 
an algorithm for calculating the probability of encountering 
a spoiler in a given wiki article. We conduct an experiment 
with 16 wiki sites for popular television shows. We find that 
38% of those pages are unavailable in the Internet Archive. 
We find that when accessing fan wiki pages in the Internet 
Archive there is as much as a 66% chance of encountering a 
spoiler. Using sample access logs from the Internet Archive, 
we find that 19% of actual requests to the Wayback Machine 
for wikia.com pages ended in spoilers. We suggest the use of 
a different minimum distance heuristic, minpast, for wikis, 
using the desired datetime as an upper bound. 
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media, viewing television shows after their air date has be¬ 
come more common, making the use of these wikis difficult 
for those who have not yet consumed all of the episodes to 
date, leading to spoilers. Spoilers are defined as pieces of 
information that user wants to control the time and place of 
their consumption, preferring to consume them in the order 
that the author (or director) intended. If these pieces of in¬ 
formation are delivered in the wrong order, enjoyment about 
a movie or television program is damaged |9 . The problem 
of spoilers has been reported in popular media for years, 
from such sources as CNN |T] and The New York Times [§]. 


The Memento Framework [22|[21j can be used to avoid spoil¬ 
ers on the web 11, 12 . Memento allows one to extend con¬ 
tent negotiation into the dimension of time, a process called 
datetime negotiation, allowing a user to choose a date 
prior to the episode they have not seen and view the web as 
it looked at that time. 


Memento provides several resource types that play a role 
in datetime negotiation. First, the original resource, also 
noted as a URI-R, is the page for which we want the past 
version. In MediaWiki parlance, it is called a topic URI, and 
refers to the wiki article in its current state. Then we have 
the memento, from which the Memento Framework gets its 
name, also noted as URI-M. It is the past version of the page. 
In MediaWiki parlance, it is called a oldid page. Third, we 
have the TimeMap, also noted as URI-T, which is a resource 
associated with the original resource from which a list of 
mementos for that resource are available. The TimeMap 
provides a list of URI-Ms and datetimes in a well-defined 
format, but does not contain any article content. Finally, 
we have the TimeGate, also noted as URI-G, which is the 
resource associated with the original resource that provides 
datetime negotiation. It is the URI to which the user sends 
a datetime and receives information about which memento 
(URI-M) is the best match for that datetime. The TimeGate 
only processes and redirects; it provides no representations 
itself. 


1. INTRODUCTION 

From How I Met Your Mother to Game of Thrones , fans 
have created fan-based wikis based on their favorite episodic 
fiction. For a community of fans these wikis become the focal 
point for continued discussion and documenting the details 
of the fictional milieu. The first study on fan wikis was done 
on the wiki Lostpedia , for the show Lost [15] . 

Unfortunately, due to the rise in the availability of recorded 


Wikis preserve every revision of a page as mementos, acces¬ 
sible via a series of URI-Ms. The web archive then captures 
some of those revisions as general mementos, accessible via a 
different series of URI-Ms. Unfortunately, for a web archive, 
there are missed updates that are never recorded, so we 
are unsure of the interval for which any given general me¬ 
mento is valid. For a wiki, we have every revision and no 
missed updates , so we do know the interval of their valid- 
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ity. This makes wiki revisions a special case of mementos. 
For the sake of this paper, to differentiate between the two 
sources, we use the term revision when referring to a me¬ 
mento saved by a wiki, and the term memento when refer¬ 
ring to the more general mementos residing at a web archive. 
The full discussion of models and durations of validity is out¬ 
side of the scope of this paper. 

Figure^ shows two timelines. The bottom timeline consists 
of mementos captured by a web archive. The top line con¬ 
sists of several revisions of a wiki page. From this figure, we 
see that memento mk was archived by the web archive at 
datetime £ 14 , which we denote as rrik@ti 4 , or, also more 
generically £ m?c . Likewise, revision £ rj _ 4 @£2 denotes the 
time for which wiki revision rj -4 was saved. Arrows be¬ 
tween the memento line and the revision line show which 
mementos are captures of which revisions. We denote this 
as rrik = rj , indicating that rrik is a capture of rj. We see 
that revisions rj- 3 , rj- 2 , and rj-i are never captured, mak¬ 
ing them missed updates. 

In Figure [2] we add a third timeline for events, showing the 
pattern observed by Steiner | 20 ] where events inspire wiki 
revisions to be created. In this case events correspond to 
television episodes. As seen above, these edits are eventually 
captured by the web archive. We use the nomenclature t &i 
to refer to the time of the i th episode. We also use the e\ 
to refer to the first episode and e n as the latest (or last) 
episode. 

For this paper, we define the term spoiler naively as any 
memento that exists after the date desired by the user, re¬ 
gardless of the content of the memento. Figure [3] illustrates 
this concept using a as the episode datetime, and rj and 
rj +1 as revisions on either side of this event. Based on our 
definition revision rj is safe because it exists prior to episode 
ei that the user is trying to avoid. It is assumed that revi¬ 
sion 7 *j+i contains spoilers because that wiki edit occurred 
after the e*. 

It is this relationship between events and revisions that al¬ 
low our spoiler solution to work. Fans who edit wiki pages 
typically have no knowledge of an episode’s content until 
that episode airs, meaning that revisions containing that in¬ 
formation must come after the episode. 

In determining the best memento to which the user should 
be directed, web archives use a minimum distance heuristic. 
We demonstrate that this heuristic is not useful for avoid¬ 
ing spoilers. Fan wikis are a special case, because they are 
updated frequently and many of their users want to avoid 
spoilers. We do not seek to change the Internet Archive’s 
processes. We can use Memento directly on wikis to avoid 
spoilers because wikis have access to all revisions [l4]. These 
revisions are mementos in their own right, and because we 
have all revisions, we can use a different heuristic (min- 
past), that avoids mementos after the date requested by the 
user. Thus, by using Memento directly on a wiki, one can 
avoid spoilers in fan wikis. The Memento MediaWiki Exten¬ 
sion provides this functionality for MediaWiki 13], allowing 
spoiler avoidance for those who install the extension. 

As part of this temporal analysis, we will further define the 
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Figure 1: Example Timeline Showing Captured Me¬ 
mentos of Wiki Edits 
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Figure 2: Each event can inspire a new wiki revi¬ 
sion which may be captured as a memento by a web 
archive 
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Figure 3: Representation of our Naive Spoiler Con¬ 
cept 


two heuristics under consideration, mindist and minpast. 
We will also show that there is a 66 % probability of en¬ 
countering a spoiler when using web archives to access prior 
versions of wikis, because web archives use mindist. In ad¬ 
dition to not reliably helping users avoid spoilers, we find 
that 38% of the pages in our sample are not available in the 
archive. 

In this paper, we briefly show what others have done to 
study the spoiler problem, then we discuss what previous 
studies have been done on wikis. From here we discuss two 
different TimeGate heuristics and how minpast is preferred 
over mindist when a user is trying to avoid spoilers. Then we 
discuss how the mindist heuristic can lead to spoiler areas, 
where a user selects a datetime prior to an episode they 
want to avoid, but are still directed into the future. Using 
these spoiler areas, we then show how one can calculate the 
probability of encountering a spoiler for a given page. 













Armed with these concepts, we show the results of a study 
performed on 16 fan wikis for popular television shows 12], 
showing not only that these spoiler areas exist for users, but 
also the probabilities of encountering spoilers in these sites. 

We then discuss the results of a second study on logs from 
the Wayback Machine, showing that 19% of all requests end 
up in the future, indicating that the spoiler problem is real, 
and that the Wayback Machine is not a reliable tool for 
avoiding spoilers on the web. 

2. RELATED WORK 

Schirra, Sun, and Bently conducted a study of two-screen 
viewing while the television show Downton Abbey was air¬ 
ing 19 . Two-screen viewing is a process whereby those 
watching a television show episode discuss the show on a 
social media web site, such as Twitter, while the episode is 
airing. A similar study was conducted by Johns 10 . Both 
studies discovered that users would use elaborate methods 
to avoid revealing and encountering spoilers in social media 
as well as the current versions of web sites. 

Because of the phenomenon of spoilers in social media, Boyd- 
Graber, Glasgow, and Zajac conducted an evaluation of ma¬ 
chine learning approaches to find spoilers in social media 
posts [I]. They used classifiers on multiple sources to de¬ 
termine which posts should be blocked. They mention that 
spoilers refer to events “later than the viewer’s knowledge 
of the current work”, suggesting that any machine learning 
technique used for avoiding spoilers in social media must 
be smarter than just blocking all posts about a particular 
topic 6,9]. Inspired by this work were software packages 
that block spoilers from a user’s social media feed, such as 
Spoiler Shield 116 and the Netflix Spoiler Foiler |5 . 

We are proposing an orthogonal concept relating to fan wikis, 
not social media. We are also not blocking resources, rather 
indicating that the fan wiki pages can still be useful resources 
if past versions of them are accessible to users. Our solution 
can be combined with a content-based approach, but we are 
proposing a structural solution that can be combined with 
content-based solutions in the future. 

Almedia, Mozafari, and Cho produced one of the first stud¬ 
ies of the behavior of contributors to Wikipedia [§]. The 
authors discover that there are distinct groups of Wikipedia 
contributors. They suggest that as the number of articles 
increases, the contributors’ attention is split among more 
and more content, resulting in the larger number of revising 
contributors rather than article creators. This informs our 
notion of number of edits as a surrogate to the popularity 
of a page. 

Additionally, there has been some effort of preserving wiki 
pages outside of the Internet Archive. Popitsch, Mosser, 
and Phillipp have created the UROBE project for archiv¬ 
ing wiki representations in a generic format that can then 
be r econstituted into many other formats for data analysis 
17 . Interestingly, they anticipate attaching their process to 
Memento at some point later in their research so that past 
versions of their archives can be accessed by datetime. 

3. MEMENTO TIMEGATE HEURISTICS 
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Figure 4: Demonstration of the mindist and min¬ 
past heuristics; ra 3 @Uo is chosen by mindist whereas 
777,2 @£7 is chosen by minpast 


When the user selects a desired datetime prior to the episode 
they have not yet seen, the TimeGate is what determines 
which memento they are redirected to. In the case of spoil¬ 
ers, the wrong heuristic can redirect the user to a spoiler 
even though they requested a datetime prior to the event 
that would have caused the spoiler. 

Memento TimeGates accept two arguments from the user: 
desired datetime (specified in the Accept-Datetime header) 
and a URI-R; and they return the best URI-M using some 
heuristic |2 . RFC 7089 leaves the heuristic of finding the 
best URI-M up to the implementor, stating that “the exact 
nature of the selection algorithm is at the server’s discretion 
but is intended to be consistent” |2j]. Figure [4] shows the 
differences between the mindist and minpast heuristics used 
for TimeGates. 

Mindist (minimum distance) finds the closest memento to 
the given desired datetime t a - Mindist is best used for web 
archives, which are typically sparse, meaning they may have 
missed many revisions of a page. In this case, a user would 
want the closest memento they can get to the date they 
are requesting because the dates of capture may be wildly 
distant from one another. Because of the fact that it may 
choose mementos from a date after the desired datetime, 
mindist is not a reliable heuristic for avoiding spoilers. 

This heuristic is useful in cases where there are few memen¬ 
tos recorded for a web page. Consider an example where 
only two mementos exist, from 2003 and 2009. If the user 
wishes to see the page as it looked on 2008, the 2009 (min¬ 
imum distance) memento is likely best. Most web archives 
are sparse, hence mindist is used to satisfy the majority of 
use cases. This heuristic is what the Wayback Machine uses, 
and is not user-configurable. 

Minpast, short for minimum distance in the past, finds the 
closest memento to the desired datetime t a , but without go¬ 
ing over t a ■ Minpast is best used for archives are abundant 
with mementos. Ideally, minpast should be used if every 
revision of a resource has been archived, as with wikis. For 
wikis, the value of desired datetime t a corresponds to a revi¬ 
sion that actually existed at the time of t a . For web archives 
that are not abundant, information may be lost because they 
may not have captured all revisions. Minpast can be used to 
avoid spoilers. If we select a value for t a prior to the event 
we want to avoid, then minpast will not find any mementos 
after t a • It is best used for wikis where we have access to all 
revisions because we can definitively state that the memento 
returned is the page as it existed at t a ■ 
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Figure 5: Example of pre-archive spoiler areas 
(shown in light red) created using the mindist 
heuristic; the overlap of the spoiler areas for 
episodes e?> and e 2 is shown in darker red. 


Figure 6: Example of a archive-extant spoiler area 
(shown in light red) created by using the mindist 
heuristic, h is the midpoint between m k -i and mk 


4. SPOILER PROBABILITIES 

By studying mindist using wiki revisions and the mementos 
corresponding to them, we can measure the probability of 
encountering a spoiler for a given wiki page in web archives. 

The set of datetimes where the user is redirected to a me¬ 
mento after the episode, even though they chose a datetime 
prior to the episode is defined as a spoiler area. 

The set of datetimes where the user is directed to a spoiler, 
even though they chose a datetime prior to the episode they 
are avoiding, and where the web archive has not yet started 
archiving the resource, is referred to as a pre-archive spoiler 
area. Figure [5] shows two pre-archive spoiler areas. This 
spoiler area is created if the user tries to select a datetime 
prior to episode e 3 @£n, but the mindist heuristic delivers 
them to mi@ti 4 = rj@t 13 , which is after e 3 @£n. The 
user intended to avoid spoilers for episode e 3 , but got them 
nonetheless because the archive’s earliest memento is after 
the desired datetime. 

So, for a pre-archive spoiler area to exist, the following con¬ 
ditions must be present: 

1. The TimeGate for the resource uses the mindist heuris¬ 
tic 

2. We have access to all revisions of a given resource 

3. The Memento-Datetimes times for all revisions of a 
resource are defined and known 

4. Event e must occur prior to the first memento recorded 
in the archive 

5. Event e must occur prior to revision n corresponding 
to the first memento m 1 (i.e., r* = m± A t e < t rj ) 

Given episodes e± to e*, which occur just prior to the first 
archived revision rj = mi, this gives us the definition of a 
pre-archive spoiler area for episode ei defined by function S a 
over the interval t s and ending at finish datetime t / produced 
by Equation 0 


( (t e 1 , t ei ) if t ei < t rj 

[ts,tf] Sa(ei) < A Vj m k 

[ ( 0 , 0 ) otherwise 

( 1 ) 

Figure [ 6 ] shows an archive-ext ant spoiler area. Let a 

user select a datetime prior to ei@tn. To avoid spoilers, the 
user needs to be directed to memento rrik-i corresponding 
to revision r 7 _i. 

Unfortunately, if the user selects a datetime in the area be¬ 
tween £9 and ei@£n, mindist will return memento m k @t± 3 , 
even though they chose a datetime prior to £n. Memento 
mk@ti 3 = Vj@ti 2 , and rj exists after the datetime £n that 
the user was trying to avoid. Because the user chose a date¬ 
time prior to the episode containing spoilers, but the user is 
redirected to a memento containing spoilers anyway. 

Why is this a spoiler area? Remember that mindist finds the 
minimum distance between the time t a specified by the user 
and any given memento. In Figure [ 6 ] we have mementos 
mk-i @£5 and ra/e@£i 3 . We denote the midpoint between 
mementos as h (for halfway). This means that any value t a 
such that £9 < t a < t ±3 will produce memento rrij and any 
value t a such that t a < £9 will produce memento rrij _ 1 . 

So, for a archive-extant spoiler area to exist, the following 
conditions must be present: 

1 . The TimeGate for the resource uses the mindist heuris¬ 
tic 

2. We have access to all revisions of a given resource 

3. The memento-datetimes times for all revisions of a re¬ 
source are defined and known 

4. Event e must occur between the memento-datetimes of 

two consecutive mementos mk -1 and m k (i.e., t rnk _ 1 < 

te ^ ) 

5. Event e must occur prior to revision n corresponding 
to memento mj (i.e., rj = m k A £ e < t rj ) 

6 . The midpoint th between rrij-i and mj must occur 
prior to event e: (i.e., t rnk _ 1 < th < t e < tm k ) 
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Figure 7: Example of a potential spoiler zone, 
stretching from £ ei to t en 


Figure [§] shows a spoiler area (£4 to £ 5 ) inside a potential 
spoiler zone (£ ei to £ en ). Consider randomly choosing a de¬ 
sired datetime within this zone. What is the probability of 
landing inside the spoiler area for given episode e? 

Probability is defined as the number of times something can 
occur divided by the total number of outcomes [23]. The 
smallest unit of datetime on the web is the second. We can¬ 
not gain more precision over time due to the fact that HTTP 
headers (and hence Memento-Datetimes) use the second as 
the smallest unit. Consider iterating through every second 
between ei and e n , incrementing the value of counter s for 
each second that falls within a spoiler area. If we let c be the 
number of seconds between ei and e n , then the probability 
of encountering a spoiler is shown by equation ©• 
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Figure 8: Example of a spoiler area (light red area) 
for episode e* inside potential spoiler zone (dotted 
red rectangle), stretching from £ ei to £ Cn 


Given consecutive mementos rrik-i and rafc, the midpoint 
th between them, and revision rj = rrik , this gives us the 
definition of an archive-extant spoiler area defined by func¬ 
tion Sb over the interval beginning at start datetime t s and 
ending at finish datetime £/ produced by Equation 


[ts ,tf] = Sb(e) 


( th,t e ) 
. (0,0) 


if th <t e < t r . 

A Tj = rrik A 

, _ tm k _ 1 +tm k 

th 2 

otherwise 


( 2 ) 


g 

Pr(spoiler) = - (3) 

c 

Once we have determined the probability of encountering 
a spoiler for a resource within the Internet Archive, we can 
then use that probability to compare that resource to others. 
In this way we can determine how safe a given URI is for 
users who want to avoid spoilers using the Wayback Machine 
or a Memento TimeGate that uses the mindist heuristic. 


5. MEASURING SPOILER PROBABILITY 
IN POPULAR WIKIS 

We selected 16 fan wikis based on television shows for our 
experiment. Table [l] shows some of details for each fan wiki. 
Each television show selected has had at least two seasons 
and a currently active wiki. House of Cards was chosen be¬ 
cause an entire season is released on Netflix in a single day, 
making it different from networked television shows. Lost 
was chosen beca use its wiki, Lostpedia , has undergone aca¬ 
demic study 15], and is the oldest and largest fan wiki under 
consideration. We used a process, simplified in Algorithm 
[l] to process each wiki and identify the spoiler areas created 
by mindist. Episode dates were supplied by epguides.com. 

Utilizing this method, we computed additional statistics based 
on the revisions, mementos, the memento-revision mapping, 
and the spoiler areas. 


So, how does one handle multiple episodes? What does 
that mean for our spoiler areas? For a given resource, using 
mindist, what is the chance of attempting web time travel 
with Memento and getting a spoiler? 

First we define a potential spoiler zone across the length 
of the series we are looking at. The start datetime of the 
potential spoiler zone is £ ei , the datetime of the first episode. 
The end datetime of our potential spoiler zone is £ en , the 
datetime of the last (or latest) episode. We assume that 
a user searching for datetimes prior to the first event e± 
should get no spoilers, so that is the lower bound. We also 
assume that no additional spoilers can be revealed after the 
last event e n . This provides a single area in which we can 
determine the probability of getting a spoiler for a single 
episode in the series. Figure [ 7 ] shows an example of such a 
zone. 


Out of the 40,868 wiki pages processed for this experiment, 
we discovered that many of them were wiki redirects. Redi¬ 
rects are used to deal with articles that can be referred to by 
multiple names. Sometimes wiki editors may not know the 
real name of an introduced character until much later, and 
will use a redirect from the old name to the new. Sometimes 
wiki editors will create pages not knowing that one already 
exists, leaving future editors to create a redirect now that 
they know that a new page title was desired. Because of 
the number of redirects that contained only a single revi¬ 
sion and only a single memento, we removed the redirects 
from consideration for calculation of spoiler areas and other 
statistics. This removed 16,394 pages from consideration, 
leaving us with 24,474 pages to process. 

The wiki XML exports were downloaded at a different time 
than the TimeMaps for those wiki pages. To overcome this 



































Table 1: Fan wikis used in the spoiler areas experiment 


Television Show (Network) 

Wiki URI 

# of Pages 

tr\ 

tei 

% of pages in 

Internet 

Archive 

the Big Bang Theory (CBS) 

bigbangtheory.wikia.com 

1120 

2007-12-14 

2007-09-24 

68.8% 

Boardwalk Empire (HBO) 

boardwalkempire.wikia.com 

2091 

2010-03-18 

2010-08-23 

80.6% 

Breaking Bad (A&E) 

breakingbad.wikia.com 

998 

2009-04-27 

2008-01-20 

76.0% 

Continuum (Showcase) 

continuum.wikia.com 

258 

2012-11-13 

2012-05-27 

86.8% 

Downton Abbey (BBC) 

downtonabbey.wikia.com 

784 

2010-10-04 

2010-09-26 

53.1% 

Game of Thrones (HBO) 

gameofthrones.wikia.com 

3144 

2010-06-24 

2011-04-17 

75.8% 

Grimm (NBC) 

grimm.wikia.com 

1581 

2010-04-14 

2011-10-28 

57.5% 

House of Cards (Netflix) 

house-of- cards.wikia.com 

251 

2013-01-11 

2013-02-01 

97.2% 

How I Met Your Mother (CBS) 

how-i-met-your-mother.wikia.com 

1709 

2008-07-21 

2005-09-19 

58.7% 

Lost (ABC) 

lostpedia.wikia.com 

18790 

2005-09-22 

2004-09-22 

39.1% 

Mad Men (AMC) 

madmen.wikia.com 

652 

2009-07-25 

2007-06-03 

85.0% 

NCIS (CBS) 

ncis.wikia.com 

5345 

2006-09-25 

2003-09-23 

93.2% 

Once Upon A Time (ABC) 

onceuponatime.wikia.com 

1470 

2011-08-09 

2011-10-23 

79.9% 

Scandal (ABC) 

scandal.wikia.com 

331 

2011-06-07 

2012-04-05 

82.8% 

True Blood (HBO) 

trueblood.wikia.com 

1838 

2008-10-06 

2008-09-07 

74.1% 

White Collar (USA) 

whitecollar.wikia.com 

506 

2009-10-30 

2009-10-23 

79.1% 


Table 2: Spoiler probabilities for most popular pages within each fan wiki 


Wiki 

Page Name 

Probability 
of Spoiler 

# of 

Spoiler Areas 

# of 

Revisions 

# of 

Mementos 

bigbangtheory 

Sheldon Cooper 

0.31 

69 

1958 

30 

boardwalkempire 

Nucky Thompson 

0.15 

31 

290 

15 

breakingbad 

Walter White 

0.43 

40 

882 

20 

continuum 

Keira Cameron 

0.54 

21 

104 

5 

downtonabbey 

Sybil Branson 

0.42 

23 

580 

3 

gameofthrones 

Daenerys Targaryen 

0.16 

24 

768 

29 

grimm 

Nick Burkhardt 

0.39 

30 

795 

5 

house-of-cards 

Frank Underwood 

0.0 

13 

380 

3 

how-i-met-your-mother 

Barney Stinson 

0.55 

120 

588 

13 

lostpedia 

Kate Austen 

0.67 

94 

3531 

27 

madmen 

Mad Men Wiki 

0.22 

36 

250 

85 

ncis 

Abigail Sciuto 

0.67 

182 

404 

11 

onceuponatime 

Emma Swan 

0.36 

34 

1210 

11 

scandal 

Main Page 

0.60 

31 

250 

14 

trueblood 

Eric Northman 

0.28 

47 

931 

14 

whitecollar 

Neal Caffrey 

0.29 

38 

199 

8 


Table 3: Statistics for each fan wiki 


Wiki 

Probability of Spoiler 

Revisions/Day 

Mementos/Day 

Mean 

std dev 

Rel Err 

Mean 

std dev 

Rel Err 

Mean 

std dev 

Rel Err 

bigbangtheory 

0.667 

0.160 

0.0116 

0.0506 

0.0668 

0.0639 

0.0033 

0.0034 

0.0488 

boardwalkempire 

0.417 

0.170 

0.0160 

0.0102 

0.0185 

0.0718 

0.0022 

0.0026 

0.0452 

breakingbad 

0.746 

0.205 

0.0127 

0.0185 

0.0351 

0.0872 

0.0032 

0.0032 

0.0459 

continuum 

0.394 

0.177 

0.0471 

0.0317 

0.0250 

0.0829 

0.0051 

0.0023 

0.0479 

downtonabbey 

0.585 

0.174 

0.0196 

0.0374 

0.0636 

0.1124 

0.0020 

0.0013 

0.0419 

gameofthrones 

0.473 

0.248 

0.0122 

0.0425 

0.0652 

0.0356 

0.0041 

0.0049 

0.0279 

grimm 

0.479 

0.175 

0.0201 

0.0700 

0.0857 

0.0672 

0.0027 

0.0015 

0.0305 

house-of-cards 

0.006 

0.035 

0.6705 

0.0772 

0.1364 

0.2082 

0.0075 

0.0044 

0.0687 

how-i-met-your-mother 

0.741 

0.100 

0.0046 

0.0163 

0.0220 

0.0463 

0.0014 

0.0010 

0.0263 

lostpedia 

0.768 

0.163 

0.0027 

0.0391 

0.1083 

0.0348 

0.0040 

0.0055 

0.0173 

madmen 

0.530 

0.144 

0.0133 

0.0049 

0.0076 

0.0764 

0.0014 

0.0021 

0.0755 

ncis 

0.818 

0.107 

0.0041 

0.0073 

0.0097 

0.0413 

0.0009 

0.0008 

0.0279 

onceuponatime 

0.516 

0.163 

0.0132 

0.1271 

0.1327 

0.0437 

0.0037 

0.0025 

0.0281 

scandal 

0.591 

0.165 

0.0269 

0.0418 

0.0484 

0.1120 

0.0030 

0.0019 

0.0608 

trueblood 

0.517 

0.162 

0.0106 

0.0210 

0.0410 

0.0658 

0.0016 

0.0016 

0.0345 

whitecollar 

0.390 

0.250 

0.0500 

0.0117 

0.0147 

0.0986 

0.0019 

0.0015 

0.0609 

Overall 

0.659 

0.226 

0.0029 

0.0362 

0.0871 

0.0200 

0.0032 

0.0044 

0.0114 














































































FiNDSPOiLERAREASlNWiKis(epzsoc?eLzst, wikiU RI) 

1 episodeTimes = getEpisodeTimes(episodeList) 

2 wikiTitles = getP 'AGETiTLES(wikiU RI) 

3 for each title G wikiTitles 

4 wikidump = fetchXMLdump (title, wikiU RI) 

5 revisions — extractRevisionTimes (wikidump) 

6 timemapURI — MAKETM\JRl(wikiURI,title) 

7 timemap = fetchTimeMap (timemapURI) 

8 mementos — EXTRACTMEMENTOTiMES(£zraeraap) 

9 mementoRevisionM ap — 

MAPREVSToMEMS(remszons, mementos) 

10 for each episode G episodeTimes 

11 paSpoiler Area — 

S a (episode , mementoRevisionM ap 

12 aeSpoilerArea = 

Sb(episode , mementoRevisionM ap) 

13 spoiler AreaList .append(pa Spoiler Area) 

14 spoiler AreaList. append{aeSpoiler Area) 

15 mapPageToSpoilers( 

wikipageSpoilerMap, title , spoiler AreaList) 

16 return wikipageSpoilerMap 


Algorithm 1: Algorithm for spoiler probability experiment 


Spoiler Areas for http://lostpedia.wikia.com/wiki/Kate_Austen 



Figure 9: Spoiler areas for the most popular page 
(3,531 revisions) in our data set 

Spoiler Areas for http://bigbangtheory.wikia.com/wiki/Sheldon_Cooper 


inconsistency, any mementos in TimeMaps that existed after 
the wiki page was downloaded were discarded. 

Of the 24,474 pages processed, only 15,119 pages actually 
had TimeMaps at the Internet Archive at the time the wiki 
exports were extracted. This means that roughly 38% of the 
pages under consideration were not available in the Internet 
Archive. 



Figure [9] shows our spoiler area graph for the page with the 
most revisions in our entire dataset, a page from Lostpedia 
about the character Kate Austen. Each spoiler area is shown 
in red using an alpha channel that gives it some degree of 
transparency. When these transparent red areas stack up, 
of course the red gets darker, so we cannot reliably see all of 
the pre-archive spoiler areas that exist prior to the first me¬ 
mento. The probability of encountering a spoiler for Kate’s 
page is 67%, calculated by Equation ©• Because this page 
only has a few mementos around 2009 and then a long break 
for the Internet Archive until 2011, there are a few archive- 
extant spoiler areas, also shown in red, both around the 
2009 mark. We also see some archive-extant spoiler areas, 
and also after the memento halfway mark in 2010. 

Figure [IT] shows spoiler areas for the page about the Big 
Bang Theory character Sheldon Cooper. The Internet Archive 
is more aggressive at archiving in 2008 than it was during 
the run of the show Lost (starting 2004), so there are only 8 
pre-archive spoiler areas for this page, compared to Kate’s 
86. There are, however, 61 archive-extant spoiler areas, com¬ 
pared with Kate’s 8. Sheldon’s page has a spoiler probabil¬ 
ity of only 31%. We can see the clusters of points indicating 
each episode on the events timeline. Because television show 
seasons occur during portions of the year, we can see the sea¬ 
sons, and partial seasons, for Big Bang Theory on the top. 
Even though Sheldon’s page contains quite a few spoiler ar- 


Figure 10: Spoiler areas for the most popular page 
(1,958 revisions) in the Big Bang Theory Wiki 

Spoiler Areas for http://gameofthrones.wikia.com/wiki/Daenerys_Targaryen 



Figure 11: Spoiler areas for the most popular page 
(768 revisions) in the Game of Thrones Wiki 

































































Histogram of Overall Spoiler Probabilities from 16 Wikia Sites 



Spoiler Probability 

Figure 12: Histogram of spoiler probabilities for all 
16 wiki sites 
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Cumulative Distribution Function of 
Probabilities of Encountering a Spoiler 
Using Mindist on Mementos From 16 Wikia Sites 



Probability of Encountering a Spoiler 

Figure 13: Graph of the cumulative distribution 
function of spoiler probabilities for all 16 wiki sites 


History of 

Redundant Mementos Per Day 
For 15000+ Wiki Pages 



Figure 14: Visualization of missed updates; darker 
colors represent more missed updates 

eas after the second season, there appears to be a block of 
time before the third season where one is safe to browse this 
page and avoid spoilers. 

Figure [Tl] provides another example of a more current show, 
using a page from the Game of Thrones Wiki 

Table [2] contains statistics for the most popular page in each 
of the wikis that we have surveyed, where popularity is de¬ 
termined by the number of page revisions generated. Seeing 
as these wikis are authored by fans, readers familiar with 
many of these television shows will not be surprised that 
most of the popular pages are main characters. The table 
also lists the number of spoiler areas, revisions, and memen¬ 
tos, showing how there is not a simple relationship between 
these values that indicate the probability of encountering a 
spoiler. 

Of particular interest is the television show, House of Cards. 
Because it releases an entire season of episodes at one time, 


Figure 15: Visualization of redundant mementos; 
darker colors represent more redundant mementos 

our model breaks down. We count 13 pre-archive spoiler ar¬ 
eas for the first season, and then no archive-extant spoiler ar¬ 
eas. The pre-archive spoiler areas have no size due to the fact 
that all of them begin and end at the same time. This leads 
to a 0% chance of encountering a spoiler in this wiki, seeing 
as each season is released like a 13-hour movie rather than 
on a weekly basis. In this case, time is not able to differenti¬ 
ate between individual episodes because t ei = t e2 = ... t ei3 . 
It requires a new dimension in order to order otherwise si¬ 
multaneous events. A different situation exists with another 
Netflix series, Arrested Development , in which all episodes 
for a season are released at once, but the episodes do not 
need to be viewed in any particular order, making it diffi¬ 
cult to identify when spoilers would occur. 

Table H shows the statistics for each fan wiki. We see a 
mean overall spoiler probability of 66%. We also see that 
the number of mementos per day is an order of magnitude 
smaller than the number of revisions per day. 





































Figure [12] shows the probability distribution of encounter¬ 
ing spoilers in these wiki pages. Figure [T3] shows a cumula¬ 
tive distribution function of spoiler probabilities for all wikis 
within the data set. Here we see that the spoiler probability 
exists, in some form, for most of the pages. 

Figure p^| shows the number of missed updates encountered 
for each datetime over the history of all pages in the wiki. 
The Y-axis represents each URI in the data set. The X-axis 
is time. Lighter colors indicate fewer missed updates on that 
day. Of interest are the vertical lines seen throughout the 
visualization. The datetimes for these lines correspond to 
changes in policy at the Internet Archive. In 2009 and in 
late 2011, the Internet Archive reduced its quarantine period 
for archiving of new pages. In October of 2013, the Internet 
Archive published the Save Page Now feature [l8], leading 
to fewer missed updates after that point. 

Figure [15] shows the number of redundant mementos created 
for each datetime over the history of all pages in the wiki. 
Just as with Figure [14] the Y-axis represents each URI and 
the X-axis is time. As expected, the number of redundant 
mementos increases as the Internet Archive becomes more 
aggressive about archiving web pages. 

6. MEASURING NAIVE SPOILERS IN 
WAYBACK MACHINE LOGS 

Research has already been done by Ainsworth in how much 
drift exists within the web archive 1 . That study indicates 
that the Wayback Machine uses a sliding target policy. 
This means that each request is in some way based on the 
datetime of the last request, resulting in a user ending up in 
a much different datetime than they had originally started. 
The Wayback Machine still uses the mindist heuristic to de¬ 
termine which memento to deliver to a user, but it changes 
the desired datetime t a based on the datetime of the me¬ 
mento from the last request. 

Contrary to this, Memento uses a sticky target policy, al¬ 
lowing a user to fix the datetime t a throughout their brows¬ 
ing session. While the sparsity of the archives introduces 
some small drift with the sticky target policy, it is con¬ 
strained by the datetime remaining constant in each request. 
That drift is introduced only by the mindist heuristic rather 
than the sliding behavior of the Wayback Machine. 

We are concerned about whether or not the user ended up in 
the future of where they intended. We want to know if they 
encountered a spoiler when using the Wayback Machine. We 
conducted a studying using anonymized Wayback Machine 
logs spanning January 1, 2011 through March 10, 2011 and 
August 1, 2011 through March 26, 2012. 

The logs from the Wayback Machine are in Apache common 
log format. Using the referrer for each request, we can track 
where the user came from and determine where they ended 
up. Fortunately for us, we can infer the desired datetime (re¬ 
ferred to as t a previously) and the memento-datetime from 
the URIs themselves. The Internet Archive allows access to 
all mementos using a standard URI format and the datetime 
is embedded in the URI. For the URI visited by the user, this 
datetime indicates the memento-datetime. For the referrer 
URI, this datetime indicates their desired datetime. 


FlNDSPOILERSlNLOGFlLE(Zog/i/e) 

1 for each visitor ID , visitedU RI , referrer G log file 

2 tm = GEtDate (yisitedURI) 

3 t a — GEtDate (referrer) 

4 wikidump — FETCnXMLDUMP(title,wikiURI) 

5 revisions — EXTRACTREVlSiONTiMES ( wikidump ) 

6 t r — getRevMatchingMemento (tm, revisions') 

7 spoiler = INDETERMINATE 

8 if rev is not NULL 

9 spoiler — ( t a < t r ) 

10 print (visitor ID + ” , ” + spoiler ) 


Algorithm 2: Algorithm for Detecting spoilers in Internet 
Archive Logs 


Why do we say that we can infer the desired datetime? 
Without interviewing the visitors to the Wayback Machine, 
it is impossible to determine intent. The fact that the logs 
are anonymized makes this completely impossible. We are 
making the assumption that some of the users receiving 
these responses intended to receive responses on the date 
that they started at, not the date delivered by the drift 
caused by the mindist heuristic. 

From these logs we can determine the inferred desired date¬ 
time from the referrer URI and the memento-datetime from 
the visited URL Using this information, we can download 
the wiki exports, as in the previous experiment, and deter¬ 
mine if the page revision recorded by the web archive exists 
in the future of the desired datetime. 

All requests for archived pages from wikia.com were ex¬ 
tracted from the logs, resulting in 1,180,759 requests. Of 
those requests, we removed all requests for images, JavaScript, 
style sheets, supporting wiki pages (such as Template, Cat¬ 
egory, and Special pages), and advertisements. This left us 
with 62,227 requests to review. 

For those remaining wikia.com pages, we downloaded the 
wiki export files, as done in the previous experiment, mapped 
the visited URI to the request that it had archived, and com¬ 
pared the datetime of that revision with the inferred desired 
datetime. We use t a to represent the inferred desired date¬ 
time, and t r to represent the datetime of the wiki revision 
matching the visited URI in the Wayback Machine. 

Each response can be split into three categories in terms of 
spoilers: (1) spoiler - t a < t r ; (2) safe - t a > t r \ (3) in¬ 
determinate - either the datetime for the revision or the 
referrer was not able to be determined, likely because the ar¬ 
ticle or whole wiki was moved or no longer exists, or because 
of 503 HTTP status codes due to the size of the export file. 

This process, shown in Algorithm [2] determines how many 
requests are either spoiler, safe, or indeterminate for each 
log file. Indeterminate entries make up the bulk of the data 
collected, but offer no meaningful insight into the spoiler 
problem, and are thus discarded. From this study we found 
that roughly 19% of these requests to the Wayback Machine 
result in spoilers. 



7. CONCLUSIONS 

We have introduced the notion of different heuristics for use 
with Memento TimeGates. We have shown that the mindist 
heuristic, while useful for sparse archives, is not reliably ef¬ 
fective for users trying to avoid spoilers with Memento. We 
have also proposed minpast as a superior choice for wikis, 
who have access to every revision. 

We have shown that roughly 38% of the pages under consid¬ 
eration were not available in the Internet Archive. We also 
found that, for the wiki sites under consideration, there is a 
mean 66% probability that one will end up with a spoiler if 
they use TimeGates supporting the mindist heuristic. Also, 
from our sample logs from the Wayback Machine, 19% of re¬ 
quests to wikia.com end in spoilers. This presents a problem 
for episodic fiction fans trying to use the Wayback Machine, 
or the Internet Archive through Memento, to avoid spoilers. 
This further demonstrates that using Memento directly on 
wikis, using minpast, is better for avoiding spoilers. 
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APPENDIX 


Spoiler Areas for http://lostpedia.wikia.com/wiki/Kate_Austen 
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Figure 16: Spoiler areas for the most popular page in Lostpedic 0 


Spoiler Areas for http://bigbangtheory.wikia.com/wiki/Sheldon_Cooper 
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Figure IT: Spoiler areas for the page in the Big Bang Theory Wiki that contains the most revision 


2 http://lostpedia.wikia.com/wiki/Kate_Austen 
^ http://bigbangtheory.wikia.com/wiki/Sheldon_Cooper 












































































Spoiler Areas for http://boardwalkempire.wikia.com/wiki/Nucky_Thompson 
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Figure 18: Spoiler areas for the page in the Boardwalk Emprire Wiki that contains the most revision 

Spoiler Areas for http://breakingbad.wikia.com/wiki/Walter_White 
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Figure 19: Spoiler areas for the page in the Breaking Bad Wiki that contains the most revision* 
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Spoiler Areas for http://continuum.wikia.com/wiki/Kiera Cameron 
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Figure 20: Spoiler areas for the page in the Continuum Wiki that contains the most revision 

Spoiler Areas for http://downtonabbey.wikia.com/wiki/Sybil_Branson 
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Spoiler Areas for http://gameofthrones.wikia.com/wiki/Daenerys_Targaryen 
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Figure 22: Spoiler areas for the most popular page in the Game of Thrones Wik%\ 
Spoiler Areas for http://grimm.wikia.com/wiki/Nick_Burkhardt 
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Figure 23: Spoiler areas for the page in the Grimm Wiki that contains the most revision^] 
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Spoiler Areas for http://house-of-cards.wikia.com/wiki/Frank_Underwood 
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Figure 24: Spoiler areas for the most popular page in the House of Cards Wik^\ 


Spoiler Areas for http://how-i-met-your-mother.wikia.com/wiki/Barney_Stinson 
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Spoiler Areas for http://madmen.wikia.com/wiki/Mad_Men_Wiki 
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Figure 26: Spoiler areas for the page in the Mad Men Wiki that contains the most revision^] 

Spoiler Areas for http://ncis.wikia.com/wiki/Abigail_Sciuto 
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Spoiler Areas for http://onceuponatime.wikia.com/wiki/Emma_Swan 
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Figure 28: Spoiler areas for the page in the Once Upon A Time Wiki that contains the most revisions 


Spoiler Areas for http://scandal.wikia.com/wiki/Main_Page 
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Figure 29: Spoiler areas for the page in the Scandal Wiki that contains the most revision^] 
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Spoiler Areas for http://trueblood.wikia.com/wiki/Eric_Northman 
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Figure 30: Spoiler areas for the page in the True Blood Wiki that contains the most revision^] 

Spoiler Areas for http://whitecollar.wikia.com/wiki/Neal_Caffrey 
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Figure 31: Spoiler areas for the page in the White Collar Wiki that contains the most revisions 
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